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Abstract 



\ We analytically derive the geometrical structure of the weight space in 

' multilayer neural networks (MLN), in terms of the volumes of couplings as- 

^ \ sociated to the internal representations of the training set. Focusing on the 

\ parity and committee machines, we deduce their learning and generalization 

\ capabilities both reinterpreting some known properties and finding new exact 

- ^ , results. The relationship between our approach and information theory as 

^ . 

H , well as the Mitchison-Durbin calculation is established. Our results are exact 

in the limit of a large number of hidden units, showing that MLN are a class 
of exactly solvable models with a simple interpretation of replica symmetry 
breaking. 
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Memorization, rule inference or information processing by a neural network may be seen 
as a complicated selection of one part of its whole weight space Statistical mechan- 

ics has permitted a quantitative study of this selection process for the simple perceptron 
by providing a measure on the weight space resulting from learning 0. In particular, the 
purely geometrical meaning of the spin-glass order parameter |^ has been shown to emerge 
naturally in this context. These techniques have been successfully applied to simple models 
of multilayer neural networks (MLN) to compute their storage capacities and generalization 
errors However, a geometrical picture of MLN's weight space and thus a unique "con- 

ceptual" frame allowing for the interpretation of the physical and computational behaviour 
is lacking so far. 

In this letter we analytically derive such geometrical structure for MLN and show how 
it is hidden in the usual Gardner's approach. The study of the distribution of volumes of 
couplings associated to the internal representations of the training set, leads to a simple 
geometrical interpretation of replica symmetry breaking (RSB) and allows to deduce the 
networks learning and generalization properties. Moreover, we show the key importance of 
the issue for analyzing the encoding of information provided by the internal representations 
^ in the intermediate layers of MLN by establishing a correspondence with information 
theory and the Mitchison-Durbin calculation [0. For the storage problem, we focus upon 
the volumes giving the dominant contribution to Gardner's total volume, whose number A/d 
is smaller than the total number A/}? of non-empty volumes. For the parity and committee 
machines with 1) hidden units, A/d and Mr both vanish at and ^y/\ogK (so far 

unknown) respectively. Our results are shown to be exact in this limit and are likely to coin- 
cide with the storage capacities of both machines. For finite K, we give a general geometrical 
interpretation of RSB together with numerical results in the case K = 3. The inference of a 
learnable rule is studied along the same lines. We first reinterpret recent results concern- 
ing the Bayesian learning of a rule by a parity machine. We then explain the smoothness 
of the generalization curve of the committee machine near its Vapnik-Chervonenkis (VC) 
dimension d^c ~ \^logK and conjecture a cross-over to lower generalization error for 

2 



a ^ K. 

In the following, we shall consider tree-like MLN, composed of K non-overlapping per- 
ceptrons with real-valued weights Ju and connected to K sets of independent inputs 
{£ = 1,...,K, i = 1, N/K) IQ. The output a of the network is a binary function 
f{Ti,...,TK) of the cells = sign(J2i Ju^.u) in the first hidden layer. The set {r^} will 
be called hereafter internal representation of the input pattern {C,ei}. For the parity and 
committee machines, the decoder functions / are respectively H^Tf and signiJ^eTe)- The 
training set to be stored in the network includes P = aN patterns {C,£i} and their corre- 
sponding outputs {fi = 1,...,P). For simplicity, both patterns and outputs are drawn 
according to the binary unbiased distribution law. In order to store the patterns, one must 
find a suitable set of internal representations T = {t^} with a corresponding non zero 
volume 

Vr= [ EdJuUHcr'mrnm^^ UT^J^^^i) (1) 

where 6'(. . .) is the Heaviside function and the integral over the weights fulfills / H^.i dJu = 1. 
Gardner's total volume is simply Vq = J2t and the critical capacity of the network is 
the value etc of the maximal size of the training set such that log Vq is finite, where the 
bar denotes the average over the patterns and their corresponding outputs 0. Moreover, 
the partition of Vq into connected components may be naturally obtained using the VV's as 
elementary "bricks". Indeed, from definition (|l|), the set of weights {Ju} contributing to a 
given Vr is convex (or empty). For the parity machine, two volumes corresponding to two 
adjacent set of internal representations (i.e. differing for one single t^) cannot coexist (they 
would give opposite outputs for the pattern /i) and one of them at least must be empty. 
Thus each connected component of Vq coincides with one and only one volume associated to 
an internal representation. For the committee machine, a connected component of Vq may 
include several volumes Vr- The labelling of the different subsets of Vq using the internal 
representations of the training set T may therefore be redundant depending on the particular 
decoder under study. It is nevertheless a convenient starting point from the analytical point 
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of view and, as shown below, it does capture the main features of the geometry of the 
couphng space. 

The formahsm recently introduced for a toy-model of MLN |Q] can be used to compute 
the distribution of the "sizes" of the volumes associated to the internal representations T. 



Once the canonical free-energy g(r) = —^\og{J2T^T ) known, one obtains the micro- 
canonical entropy Af{k) (i.e. the logarithm of the typical number) of volumes Vr whose sizes 
are equal to A; = logVV using the Legendre relations kr = ^^^q^^"* and Af{kr) = [H- 
The average over the patterns is performed using the replica trick for r integer expecting 
that the final results remains valid for any real value of r. There are r blocks (p = 1, . . . , r) 
of n replicas (a = 1, . . . ,n). Thus the spin glass order parameters are the matrices and 
of the typical overlaps q^'''^^ = Jti J\f between two weight vectors incoming onto 

the same hidden unit i = 1, . . . ^K) and of their conjugate Lagrange multipliers (ff''^^ . 
Since all the hidden units are indistinguishable, we assume that at the saddle point Qi= Q 
and Q,i = Q, independently of i. Within the replica symmetric (RS) Ansatz 0, we find 

g{r) = Extr | ^—^ log(l - q*) - ^ log(l - + r(g, - q)) - — - 

1^1' { 2r 2r 2(1 - q^ + r[q^ - q)) 

fUDx, lognUxe})] (2) 

where n{{x,}) = Ti {r,}Ue I DyiH[{ye^/^~^ + nxe^)/./T^^\ Here, g,(r) = g-^"-"^ 
and g(r) = q°-P^^^ are the typical overlaps between two weight vectors corresponding to the 
same (a,p 7^ A) and to different (a 7^ b) internal representations T respectively [|],|[. The 
Gaussian measure is denoted by Dx = whereas the function H is defined as 

H{y) = Dx. In eqn.(0), the sum Trj^-^} runs over the internal representations {r^} giving 
a positive output /({r^}) = +1 only, since the outputs can always be set equal to +1 at 
the cost of redefining the input patterns. 



The whole distribution of sizes is available through g{r). When N ^ 00, -^log(VG) = 
—g{r = 1) is dominated by volumes of size kr=i whose corresponding entropy (i.e the loga- 
rithm of their number divided by N) is A/d = N'{kr=i). At the same time the most numerous 
volumes are those of smaller size kr=Q, since in the limit r — > all the T are counted ir- 



respectively of their relative volumes. Their corresponding entropy N'r = J^{kr=o) is the 

normalized logarithm of the total number of implementable internal representations. The 

quantities Afo and Mr (that for lack of space we do not write explicitly) are easily obtained 

from the RS free-energy eqn.(0) using the above Legendre identities. In particular, q{r = 1) 

is the usual saddle point overlap of the Gardner volume g{l) [0,0. The vanishing condition 

for the entropies should coincide with the zero volume condition for Vc and thus should give 

the storage capacity of the models. 

Both A/d and A/r have a straightforward interpretation in the context of information 

theory. One can easily verify that the quantity of information I carried by the distribution 

Vr Vt 

of the implementable internal representations T about the weights, = — 2^ — log — , 
is equal to Md- The information capacity, i.e. the maximal quantity of information one 
can extract from the internal representations, is achieved when all internal representations 
T are equiprobable and thus equals Nr. One should notice that the Mitchison-Durbin |0] 
geometrical calculation is simply an upper (and decoder-independent) bound on Mr. 

Let us see now the physical and geometrical interpretation of Nd- Fig- 1 displays the RS 
entropy Md as a function of a for both the parity and committee machines with K = ?> hidden 
units. This entropy vanishes at a critical value of the size of the training set. Numerically, 
we find an — 3.8 and 2.9 for the parity and the committee machines respectively. For 
comparison, the storage capacities obtained with the one step RSB Ansatz are — 5 and 3 
respectively Being the entropy of a discrete system, Md cannot be negative and therefore 
an is an upper bound of the size of the training set ursb where the replica symmetry 
breaking occurs for both A/d and Vg . It is indeed known that aRss = 3.2 and 1.8 for the 
parity and the committee machines respectively [Q. When a < aRSB, the RS assumption is 
exact whereas A//) is positive, showing that the number of internal representations volumes 
contributing to Vq is exponentially large with N. measures the typical overlap inside 
one of these volumes, while the usual overlap q arising in the RS computation of Vg tells 
us how far away are two different volumes Vr. The behaviour of g=K versus a is shown in 
the inset of figure 1. When choosing randomly two weights vectors storing the training set, 
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the probability that they belong to the same Vr vanishes as exp(— A^Az?) and their overlap 
distribution cannot be told from a Dirac peak in q, as must be for the RS solution to be 
exact. As a consequence, the blind computation of Vg, though it gives correct results, hides 
the geometrical structure of the weight space. In the limit of a large number K of hidden 
units, the asymptotic expressions of the overlaps and of may be obtained analytically. 
We find that g = and g ~ 1 — for the parity and the committee machines respectively 
and that ~ 1 — ^a^K^ in both cases with T = —l/{y/TT J duH{u) \ogH{u)) ~ 0.62. The 
corresponding entropies Afjj''^'^^ ~ log A — a log 2 and N'j^""^^ ~ log A — vanish at 

When a > aptsB, the computation of A/d requires the introduction, at the first stage 
of RSB, of four order parameters q'^,-,qo-,qi-,'rn : q'^ is the internal overlap of the internal 
representations volumes and qQ,qi,m are simply the usual parameters arising in the one 



step Gardner's computation [jTT|. For brevity we only present below our numerical results 
together with their geometrical interpretation. Above orsb, there exist a finite number 
of big regions with mutual overlap go- Each region p contains an exponential number of 
volumes Mp of internal overlap g^ and typically separated by an overlap gi. The number of 
such regions may be roughly estimated by jz:;^, since m = 1 — J2p{Mp/ J2p' ^p'Yi whereas 
in the RS phase m = 1. We have checked numerically this geometrical scenario for the 
parity machine with A = 3 hidden units (numerically much simpler than the committee 
machine case since go = at the saddle point). The internal overlap g^ is continuous at the 
RSB transition - see the inset of fig. 1 - with q^ < q^ for a > ansB- We conjecture that 
increasing a a whole continuous breaking of RSB occurs. The geometrical process should 
then be thought of as a progressive shrinking and disappearance of volumes with internal 
overlap g*(a) inside sub-regions characterized by g(x, a) 0. In fig. 1, we have reported the 
curve of Md computed with this one step Ansatz for the parity machine A = 3. increases 
from ^ 3.8 (RS value) to a value close to 5 and thus to the one step RSB value of ac Q 
(since g^ and gi are close to 1, our numerical results become less precise for a larger than 
~ 4.1 and a/) — 5 is obtained through the linear extrapolation corresponding to the dashed 
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part of the curve). 

The RS calculation of A//j for both machines, leads the following general results. When 
a < one finds that all the 2'^'^"^^'^ internal representations may be implemented. This 
obviously coincides with the storage capacity of the hidden perceptrons seeing only N/K 
input units. For a > we find that at the saddle-point = 1, meaning that the 
most numerous volumes Vr are almost empty and are therefore the smallest ones at the 
same time. The resolution of the saddle-point equations requires the introduction of a 
new order parameter /i = lim r/(l — q*), describing how quickly the typical size of the 

r—*0 

volumes decreases with respect to the inverse "temperature" r ^j. For the parity ma- 
chine with K > 3, q = 0, fi = aK{aK — 2) is always a locally stable saddle-point giving 
Af^'^^^ = aK\og{aK) — {aK — 1) \og{aK — 1) — a log 2 which exactly saturates the upper 
bound derived by Mitchison-Durbin |^. In the case of the committee machine, a simple 
analytical expression for is not available for finite K. Once more in fig. 1, we report 

the numerical results concerning the RS calculations of A/}? for both machines with K = 3. 
The value at which Afn vanishes should satisfy the obvious inequality ^ C(r < c^c] 
the RS approximation however overestimates leading to an expression which is slightly 
larger than the one step value of ac- For the parity and committee machines with if = 3 we 
find an = 5.4 and 3.5 respectively. This is an evidence for the necessity of RSB to compute 
exactly A/}? for finite K. 

When K ^ 1, Mr (resp. a^j) is asymptotically equal to No (resp. In the case 

of the parity machine and also coincide with the known value of = ^j^^ O- We 
expect the same equality = a/j = = ^^/logK) to hold in the case of the committee 
machine. In order to show that the RS solution of Mr is asymptotically correct, we have 
checked its local stability with respect to fluctuations of the order parameter matrices. 
Although it would require a complete analysis of the eigenvalues of the Hessian matrix, we 
have focused only on the replicons Oil and 122 in the notations of ||T^, which are usually 
the most "dangerous" modes For a free-energy functional depending only on one order 
parameter matrix q"-P'^^^ the corresponding eigenvalues are Aqu and A122 given by formula 
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(41) in ref. JTOl- In our case, however, the free-energy depends on 2K matrices {Q;, Qi} and 
the stabihty condition for each mode reads A (a, K) = A ( A + [K — 1)A ) — < where 
A, A, A are the eigenvalues computed for the fluctuations with respect to QiQi, QiQi and 
QiQm 7^ respectively 0,^. A tedious calculation leads to the final expressions Aqu 
and Ai22 0. For the parity machine, we find A\^';:"\o^. K) = f (f + ^(1 - l)f - -^^ 
and Ai^2'^''(«, fC) = which are valid for K > ?> and a> j^. The RS solution is unstable 
against Oil replicon mode for a > (i.e. of the same order as the storage capacity of each 
single input perceptron). However, in the large K limit, Aqu vanishes. For the committee 
machine, one finds Ao^°™'^(q;, i^) ^ and Ay22"^\a, K) ^ —2W^ for K ^ 1 and a ^ 1. 
We notice that the 122 mode is always stable and a unique order parameter is thus 
sufficient to describe the volume associated to a set of internal representations T. For both 
machines, our RS solution is marginally stable when K ^ oo and should therefore become 
exact in this limit. 

In order to understand what are the consequences of the weight space structure on the 
generalization ability of MLN, we now modify our approach to the case of deterministic 
input-output mappings. 

The case of the parity machine trained on a learnable rule (i.e. generated by a "teacher" 
network endowed with an identical architecture) has been recently studied in the Bayesian 
framework where the generalization properties are derived through the knowledge of the 
entropy Sg = — ;^VGlogVG'. The transition from high generalization error = | to low 
eg(=£ for large a) P may be geometrically understood along the hues developed above. 
The free-energy s(r) generating the distribution of the "sizes" of the internal representation 



volumes Vr becomes s(r) = — Y^t^t log(I]r^ ) where we obviously recover So = 
s{l). The replica calculation of s(r) technically differs from the computation of g{r) by 
taking the limit n ^ 1 instead of n ^ . Within the RS Ansatz, we find 



(1 — r 1 q 

s(r) = Extr < — — log(l - g*) - — log(l - g* + r(g* - q)) 



T [ 2r 2r ^' ^* ■ ■ v^* 2(1 + (r - l)g,) 
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I l[Dxeni{xe})\ogni{xe}) 



(3) 



with /(g*) = [2 / DzH{zy/q^/ ^yl — q*YY^ and g*(r) is the saddle-point overlap of so(r) = 
(1 — r) log(l — g*) — log(l + (r — l)g*) — 2a log /(g*). The logarithm J^d of the number 
of the internal representation volumes contributing to the Bayesian entropy Sq is given by 
■M.D = ^{r = 1). We find that for aK 3> 1, ~ 1 — 2^7^- In the case of the parity 
machine g = 0, A^^"^"* ~ log K — a log 2 for a < ao = and g = g,,,, A^^i = for a > ao- 
Thus, below ao, the weight space is composed of an exponentially large number of volumes 
and the typical overlap g between the volume occupied by the teacher and any other one is 
zero : Cg = ^. Above ao, since only one internal representation survives, the student has 
fallen down into the teacher volume : g = g* and eg ~ When a < ao, Sq"""^^ = a log 2, 
meaning that all the sets of P outputs are equiprobable. Choosing them with a probability 
^^({cr}) is then equivalent to drawing them randomly. This is the reason why ajj defined 
for the storage problem (and more generally d^c) appears on the generalization curve of the 
parity machine. Our calculation also indicates that the computation of ao should include 
RSB effects for finite K, while the asymptotic RS expression of eg ought to be exact, as has 
been found for the non-monotonic perceptron [0. 

Turning to the committee machine, a calculation of the Bayesian entropy Sq similar to 
leads to the following results when ^ a ^ 1. The typical teacher-student overlap g 
decreases as 1 — ^^^^ giving an entropy Sq^""^^ ~ 2 log a and ~ This shows that, at 
variance with the parity machine case, only a small fraction among the 2^ possible sets of 
outputs contribute to Sq^"^^ and explains why the generalization curve is smooth for a ~ 
^/\og K (which is the order of magnitude of dyY)- We find A^^""^^ ~ log fT — log a, confirming 
that ud (and thus d^c) is not relevant to the computation of the typical generalization error. 
At a^o. ~ only a single internal representation subsists and beyond this critical size of 
the training set the generalization error should equal = £ as is for finite K and large a 
0. Note that the order of magnitude of ac.o. is corroborated by the condition g = g* one 
has to fulfill once a unique Vr remains non-empty. A rigorous proof of the presence of this 
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cross-over (from e^, = 2^ to = ^) at ac.o. would however require to extend the validity of 
our calculation to the regime 1 <^ a ~ 

We are grateful to N. Brunei, M. Budinich, M. Ferrero and D. O'Kane for discussions. 
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FIG. 1. Mr (upper curves) and Md (lower curves) for the parity machine (bold) and the 
committee machine (light), with K = S hidden units. Inset: qi,q*,qi (lower, middle and upper 

curves respectively) versus a for the parity machine {qi starts at a = orsb — 3.2 with a value 
close to 0.93). 
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